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(54) Method and apparatus for configurable cache coherency 



(57) A coherency controller for configurable caches. 
A base microprocessor design accommodates system 
configurations both with and without L2 cache tag and 
data arrays installed. Second level cache control logic 
exists within the microprocessor chip, and when the ex- 
ternal second level cache tag and data arrays are re- 
moved their inputs to the microprocessor are tied to an 
inactive state. A configuration switch is set in the second 
level cache controller that causes snoop requests from 
a system bus to get reflected onto a first level cache 
snooping path. The first level cache status is then fed 
back to the second level cache controller, in a manner 
consistent with the timing required for support of a sec- 
ond level cache search, and fed into the second level 
cache status signal generation logic, effectively making 
the second level cache controller believe that the sec- 
ond level cache still exists for snooping. All other actions 
remain the same in the second level cache controller 
providing an effective and simple method for supporting 
snooping bus protocols. A result is that now every bus 
request snoops the first level cache without knowledge 
of presence of an 12 cache. This environment is provid- 
ed to support entry level single processor configurations 
where the snooping requests only amount to input/out- 
put traffic. 
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Description 

Field of the Invention 

[0001] This invention generally relates to caches for s 
computer systems, such as set associative caches and 
direct-mapped caches, and more particularly to provid- 
ing data coherency in systems including processors 
having two levels of cache and those having a single 
level cache. 10 

Background Art > 

[0002] Microprocessor development projects are 
costly in terms of time and money. As such, there is a is 
strong desire to make each of these design efforts ver- 
satile enough to cover the entire planned product range. 
The problem stems from the fact that the entry level ma- 
chines must be inexpensive in deference to perform- 
ance, while the upper end of the product range must fa- 20 
vor performance. 

[0003] Acommon wayto achieve this involves making 
the microprocessor core ■programmable/ This trans- 
lates into a need to make it work with different external 
environments. To be specific, it is designed to accom- 2s 
modate multiple levels of cache buffer, with the first built 
into the microprocessor core, and the second external 
to the chip. This external second level cache can be 
quite expensive and unnecessary for the entry level 
models. However, it is absolutely necessary to the high- 30 
end which requires that multiple processors with their 
external second level caches are coupled together and 
cooperatively work on programs and data. 
[0004] A cache is a high speed buffer which holds re- 
cently used memory data. Due to the locality of refer- 35 
ences nature for programs, most of the access of data 
may be accomplished in a cache, in which case slower 
accessing to bulk memory can be avoided. 
[0005] A typical shared memory multiprocessor sys- 
tem implements a coherency mechanism for its memory 40 
subsystem. This memory subsystem contains one or 
more levels of cache memory associated with a local 
processor. These processor/cache subsystems share a 
bus connection to main memory. A snooping protocol is 
adopted where certain accesses to memory require that 45 
processor caches in the system be searched for the 
most recent (modified) version of requested data. 
[0006] In accordance with an exemplary high-end 
system, a two level cache subsystem with level 2 (L2) 
cache line size some power of 2 larger than level 1 (L1 ) so 
cache line size is implemented. Both caches implement 
writeback policies, and L1 is set-associative. L1 is sub- 
divided into sublines which track which portions of the 
cache line contain modified data. Snoop requests from 
the bus are received at L2 and, if appropriate, the re- ss 
quest is also forwarded on to L1 . The snoop request for- 
warded to L1 , however, requires accessing the L1 direc- 
tory for all of the consecutive L1 cache entries which 
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may contain data associated with the L2 cache tine. 
[0007] Two fundamental ways to manage the con- 
tents of multiple levels of caches exist: 

1. Allow unique: the first level and second level 
caches are allowed to have unique data. 

2. Force inclusion: the first level of cache is required 
to be a subset of the second level of cache. 

[0008] In a coherent shared memory multiprocessing 
environment, each time a processor issues a request 
for memory data, the other processors' caches may 
need to be searched for copies of that data, depending 
upon the type of request. Also, in a system with a single 
processor, memory coherency needs to be maintained 
with other devices that access memory, such as I/O 
processors. 

[0009] Consider a first exemplary system including a 
plurality of high-level processors - such that private first 
and second level caches exist for each processor in the 
multiprocessing system. Assume, also that a snooping 
bus protocol is used for the multiprocessor memory hi- 
erarchy and that cache blocks may exist in cache in the 
modified, exclusive, shared, or invalid state (MESI pro- 
tocol). 

1 . ALLOW UNIQUE: In a first exemplary system op- 
erating in an "allow unique" environment where the 
first level and second level caches are allowed to 
have unique data, inasmuch as data may exist in 
either one or both levels of cache within the individ- 
ual processors, maintaining cache coherency re- 
quires that both the first and second level caches 
be searched each time an alternate processor re- 
quest shows up on the snooping bus. This causes 
interference at the first level cache which may ad- 
versely affect performance of the executing instruc- 
tion stream within the processor. Conversely, it re- 
quires an added port on the first level cache to allow 
interrogation of the cache for snooping requests. In 
any case, one is faced with pulling data from one or 
both levels of cache in order to supply it to requests 
from other processors. Additionally, coordination of 
the snoop responses between the two levels of 
cache is required within each processor. 

Consider a second exemplary system including 
one or more entry-level processors - such that pri- 
vate first level caches exist for all processors in the 
multiprocessing system, and second level caches 
do not exist for all processors in the system. In an 
"allow unique" environment the first level cache al- 
ready has the mechanisms in place to allow snoop 
requests to be handled. 

2. FORCE INCLUSION: In a forced inclusion envi- 
ronment, the contents of the first level cache must 
always be a subset of the second level cache within 



2 



EP 0 896 279 A1 



an individual processor. This allows a system to be 
built where the second level cache can control the 
snooping of external bus requests without interfer- 
ence in the first level cache unless data actually ex- 
ist there that are required. This is facilitated by a first 
level cache status array maintained at the second 
level cache controller which indicates presence, 
and possibly state, of the first level cache data. Note 
that presence is minimally required as the state of 
the first level data must be less than or equal to that 
of the encompassing second level cache block. Al- 
so, in this exemplary system, the L2 cache control- 
ler has primary responsibility for managing cache 
coherency. It also forms requests for data and other 
commands based on processor requests which are 
forwarded to the system bus. 

[0010] In a forced inclusion environment, for entry- 
level microprocessor where the second level cache is 
not included, the ability to control cache coherency is 
more difficult because the mechanisms for this control 
reside in the L2 cache controller. 
[0011] There is a requirement for a cooperative work 
environment in multiprocessor configurations including 
either low-end or high-end microprocessors. There is al- 
so a requirement to maintain data coherency in such a 
system. 

[001 2] There is also a requirement to provide a micro- 
processor design which is adaptable to either the low- 
end or high-end configuration. 

Summary of the Invention 

[001 3] I n accordance with a first aspect of the inven- 
t ion, an apparatus and method is provided for operating 
a computing system including an L1 cache. A second 
level cache controller includes a first level cache snoop- 
ing path and an interface to a system bus. The controller 
is operable in the absence of a second level cache for 
reflecting snoop requests received at the system bus in- 
terface and requests received at a processor interface 
for a second level cache to the first level cache snooping 
path. 

[0014] In a second aspect of the invention, a cache 
controller for the management of contents of a single 
level of cache includes an interface to a processor, and 
a second level cache controller operable in the absence 
of a second level cache for reflecting processor snoop 
requests received at the processor interface to a first 
level cache snooping path. 

[0015] Other features and advantages of this inven- 
tion will become apparent from the following detailed de- 
scription of the presently preferred embodiment of the 
invention, taken in conjunction with the accompanying 
drawings. 



Brief Description of the Drawings 

[001 6] Figure 1 is a block diagram illustrating a typical 
microprocessor architecture within which a preferred 
s embodiment of the invention is implemented. 

[0017] Figure 2 illustrates how Figure 2A through 2C 
relate, while the latter are block diagrams showing the 
implementation of a preferred embodiment of the inven- 
tion within the microprocessor of Figure 1 . 

10 [001 8] Figures 3-6 are block diagrams illustrating the 
system and L2 cache bus interfaces 1 01 and 1 03 of Fig- 
ure 1 , with Figure 3 generally illustrating the system data 
bus; Figure 4, the system bus controls; Figure 5, the L2 
cache data bus; and Figure 6, the 12 cache controls. 

is [0019] Figure 7 is a block diagram illustrating those 
portions of the L2 cache data bus controls of Figure 5 
utilized in the entry-level microprocessor design of the 
preferred embodiment of the invention. The unused por- 
tions are shown hatched out. 

20 [0020] Figure 8 is a bk>ck diagram illustrating those 
portions of the L2 cache controls of Figure 6 utilized in 
the entry- level microprocessor design of the preferred 
embodiment of the invention. The unused portions are 
shown hatched out. 

25 [0021] Figure 9 is a state diagram illustrating the tag 
state machine operated by the L2 cache controller of 
Figure 2C when responding to a request. 
[0022] Figure 1 0 is a flow diagram illustrating the four 
stages implementing a pipelined L1 snoop operation, 

30 and also the search operation of the invention. 

[0023] Figure 1 1 A and 1 1 B are system diagrams illus- 
trating systems including processors with and without . 
an L2 cache in accordance with the invention. 
[0024] Figure 12 is a timing diagram illustrating the 

55 timing of control and status signals on the system bus 
for explicit search (snoop) operations. 
[0025] Figure 13 is a timing diagram illustrating the 
timing of control and status signals on the system bus 
for implied search operations. 

40 

Detailed Description of Preferred Embodiments 

[0026] As used in the art, the term "snoop" may have 
several meanings. For the purpose of following descrip- 

*s \\on of the invention, the term "bus snoop" refers to L1 
and/or L2 cache directory access operations originating 
from a system bus request: this will be at the L2 cache 
and possibly at the L1 cache in L2 cache installed mode, 
or at L1 cache in L2 not installed mode. The term "L1 

so snoop" refers to an access operation occurring at the L1 
cache directory from either a system bus or processor 
request. A "bus snoop" can result in an "L1 snoop". The 
term "processor snoop" refers to a snoop of L1 cache 
generated by a processor request which originates on 

ss the interface to the 12 cache controller from either the 
ATU 124 or DC 116; this request also accesses the L2 
cache directory in systems with 12 installed. A "proces- 
sor snoop" can result in an "L1 snoop". 
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[0027] Referring to Figure 1 , the microprocessor ar- 
chitecture within which a preferred embodiment of the 
invention is implemented will be described. s 
[0028] Microprocessor chip 1 00 is organized to inter- 
face system bus 102 and 12 cache 104, and includes 
the following functional units: fixed point unit (FXU) 106 t 
floating point unit (FPU) 108, load store unit (LSU) 110, 
instruction unit (IU) 112, instruction cache unit (ICU) 10 
114, data cache unit (DCU) 116, L2 cache control unit 

118, processor interface unit (PIU) 120, clock distribu- 
tion and control 122, and address translation unit (ATU) 
124. In a multiprocessor environment, several proces- 
sors 1 00 and their associated L2 caches 1 04 may inter- ?5 
face system bus 102 over buses equivalent to bus 101 , 
and share access through system bus 102 bus to main 
memory (sometimes referred to as L3 memory) 126. 
[0029] The various functional units of microprocessor 
1 00 interface over data, address, and/or control I/O pins, 20 
lines and/or busses as will be described hereafter. When 
referring to a figure, "line" can refer to either a single 
signal line or a collection of signal lines (i.e., a bus). 
Those functional units most pertinent to the invention, 
and which will be described in greater detail hereafter, 2s 
include the load/store unit (LSU) 110, the data cache 
unit (DCU) 116, the L2 cache control unit (CCU) 118, 
and the address translation unit (ATU) 124. 
[0030] In broad overview, the functional units on chip 
1 00 communicate as follows. Clock distribution and con- 30 
trol 122 provides clocking signals to all functional units 
on microprocessor chip 100. System bus 102 interfaces 
to PIU 120 over bidirectional bus 101 , and thence over 
buses 105 with CCU 118. L2 cache 104 communicates 
with CCU 118 over buses 103. CCU 118 communicates 3S 
instructions with ICU 11 4 over buses 109, with DCU 116 
over buses 111, and provides address information to 
ATU 1 24 and receives miss interface signals over buses 
107. LSU110 and IU 112 provide request interfaces to 
ATU 124 and receive translation state information over 40 
lines 1 29 and 1 31 . ATU 1 24 provides translated address 
to ICU 114 over lines 115, and to DCU 116 over lines 
1 1 3. ICU 1 1 4 interfaces to instruction unit 11 2 over bus 

119. DCU 116 provides data to FXU 106, FPU 108 and 
LSU 1 1 0 over bus 121, and IU 112 provides instructions 4S 
to FXU 106, FPU 108 and LSU 110 over bus 123. LSU 
110 provides data to DCU 116 over bus 125. FPU 108 
provides and receives data to DCU 116 over buses 127 
to LSU 110, then across buses 125. Processor 100 ac- 
cesses main memory through system bus 102. 

Microprocessor Core 100 

[0031] Referring to Figures 2A through 2C, and Fig- 
ures 3-6, the core of microprocessor 100 will be de- 
scribed. Figure 2A generally corresponds to load/store 
unit (LSU) 110, Figure 2B to address translation unit 
(ATU) 1 24, and Figure 2C to data cache unit (DCU) 116. 
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Figures 3-6 generally correspond to L2 cache control 
unit (CCU) 118 and processor interface unit (PIU) 120. 
[0032] Dispatch block 300 directs instructions from in- 
struction unit 112 to the DECODE stage buffers of the 
varbus execution units 106, 108, 110, including on bus 
301 (which is that portion of buses 123 directed to LSU 
110) to LSU pipeline buffer 302. 
[0033] The function of toad/store unit 1 1 0 is to gener- 
ate effective addresses on 64 bit wide bus 31 3 for load 
and store instructions and to serve as a source and sink 
for GPR data. During writes to cache 400, registers 314 
and 316 hold the data and address, respectively; the ef- 
fective address is on bus 31 3, and data select block 320 
puts the data out on bus 323. During cache reads, data 
from cache 400 comes in on line 461 , is latched in reg- 
ister 330, and from there sent on line 333 to general pur- 
pose registers 306 or to fixed point unit 106. 
[0034] The output of LSU pipeline buffer 302 is fed on 
line 303 to the LSU decode and address generation 
block AG EN 304, which contains the general purpose 
registers 306 and address generation adders (not 
shown). The data output of decode block 304 is fed on 
lines 311 to data register 314 and thence on line 319 to 
data select block 320. The address output of AGEN 304 
is fed on lines 313 to EXECUTE stage buffer 316, and 
on bus 309 to real address MRU 430. AGEN 304 output 
also includes control line 307, which it sets to indicate 
either real or virtual mode addressing to data cache con- 
trol block 470 (also referred to as data cache controller 
or first level cache control ler. ) 

[0035] The outputs of buffer 31 6 are fed on lines 31 7 
to data select block 320 and to data cache address reg- 
ister 408, DIR address register 414 and register sbt 
MRU address register 406. The output of register 408 
is fed on line 409 to multiplexer 412. Data select block 
320 contains the data to be stored to data cache 400 
from load store unit 110, and this is fed thereto on store 
data output lines 323 via multiplexer 432, lines 433, align 
block 460, lines 461, register 456, lines 457, and line 
427 via multiplexer 426. Data select block 320 also pro- 
vides control signals to data cache controller 470 on 
lines 321 . The other inputs to multiplexer 432 are (1 ) L2 
corrected data 609 via multiplexer 426 and line 427, 
which is also fed to data cache 400, (2) bypass data to 
DC on line 621 , and (3) unaligned data (aka store merg- 
ing and correction) register 452 via lines 453 to line 427 
via multiplexer 426. Multiplexer 432 output line 433 is 
also fed via align block 460 and line 461 to register 456 
and thence via multiplexer 424 to L2 cache controller on 
so line 425, along with the output of castout buffer 450 on 
line 451. Align block 460 is, in this embodiment, a barrel 
rotator or shifter which aligns D cache 400 data to quad 
word boundaries on reads, and from multiplexer 432 to 
quad word boundaries on stores. 
ss [0036] An effective address from instruction unit 112 
on line 367 (a portion lines 131) is latched in register 
364 and fed on line 365 to ITLB 358 and to the compare 
and address select block 356 at ISLB 354. Line 31 3 from 
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AGEN 304 is latched in register 384, and fed on line 385 
to DTLB array 378 and compare and address select 
block 374 at DSLB 376. In this preferred embodiment, 
DTLB 378 may be a standard design, such as that de- 
scribed by Liu, supra. Whereas the Liu TLB design is 32 
bits wide, in this preferred embodiment a 64 bit wide TLB 
378 is used. 

[0037] Data select 320 output on line 325 is fed to 
PUTAWAY stage buffer 330, which also receives data 
on lines 461 from data cache 400 (via lines 401 and align 
block 460) for LSU 110, and FPU 108 results on line 327 
which is a portion of bus 1 27. The output of PUTAWAY 
stage buffer 330 is fed on lines 333 to a floating point 
register in FPU 108, special purpose registers 334 
(among which are the timers), and general purpose reg- 
isters 306. Special purpose registers 334 output line 335 
is fed back to data select block 320 which allows the 
processor to read them. Line 333 carries the data for 
FPU 108 when doing a fetch from cache 400. 
[0038] The selected output of instruction segment 
look aside buffer (ISLB) 354 is fed on lines 355 to com- 
parator 362, along with the virtual address output of 
I TLB 358 on lines 359. ITLB real address output lines 
361 are input to IC controls 350 (which includes instruc- 
tion directory 352) and status information on line 361 is 
fed to ATU controls 370. The output of comparator 362 
is fed on lines 363 to IC controls 350 and to ATU controls 
370. The output of DSLB 376 is fed on lines 377 to com- 
parator 382, along with the output of DTLB 378 on lines 
379. The output of comparator 382 is fed on lines 383 
to ATU controls 370 and DC control 470. DTLB 378 sta- 
tus output 381 is fed to ATU controls 370 and DC control 
470. ATU controls 370 outputs include control lines 369 
to data cache controller 470, L2 address 371 and ATU 
write data 373. IC controls 350 output is L2 address line 
351 . Real address from DTLB 378 is fed on lines 381 to 
DC address register 408 and directory address register 
414. 

[0039] Predicted real address MRU 430 output sig- 
nals on line 431 , representing the predicted read ad- 
dress bits 50:51, are latched in registers 410 and 416. 
The output of data cache address register 410 on line 
41 1 is multiplexed with bits 50:51 of the output of register 
408 in multiplexer 412, and its output is fed on address 
lines 413 to data cache 400. The remaining bits of DC 
address register 408 are fed straight through on line 41 3 
to data cache 400. Similarly, the output of register 416 
is fed on lines 417 to multiplexer 436, where it is multi- 
plexed with bits 50:51 of the output of register 414 on 
line 41 5, and the result fed on lines 437 to directory array 
440. The output of register 414 on line 415 is also fed 
to address register 408. 

[0040] The function of real address MRU 430 is to pro- 
vide predicted real address bits 50:51 to data cache 400 
and directory array 440. 

[0041 ] During the fetch stage, data cache 400 output 
401 is fed to unaligned data register 452 and align block 
460, and thence on line 461 to registers 456 and 330. 



Line 401 contains the data to be read from data cache 
400 by the load store unit 110, L1 snoop data to the L2 
cache controller 1 1 8, merge data for partial stores to the 
data cache 400, and castout data to castout buffer 450. 
s Slot MRU 402 output line 403 controls the selection of 
one of four sets of data to load on bus 401 through a 
multiplexer (not shown) on the output of data cache 400. 
[0042] The output of castout buffer 450 is multiplexed 
in multiplexer 424 with the output of register 452 on lines 

10 453 and line 457 from DC putaway register 456, the out- 
put appearing on lines 425 to the L2 cache controller. 
The output of register 452 along with DC putaway reg- 
ister 456 and L2 corrected data on line 609 is also fed 
to data cache input multiplexer 426, the output of which 

is appears on lines 427 to data cache 400 and multiplexer 
432. The output of register 406 is fed on line 407 to slot 
MRU 402. Slot MRU 402 output 403 is fed to data cache 
400 where it controls a data multiplexer which selects 
the appropriate cache set (as taught by Liu, supra.) 

20 [0043] Data cache (DC) control 470 receives inputs 
from directory array 440 on lines 441 (signifying a direc- 
tory array hit or miss), from AGEN 304 on lines 307, data 
select and execute cycle control block 320 on lines 321 , 
ATU controls 370 on lines 369, comparator 382 on lines 

25 383, and controls 660 on search control line 481 . Its out- 
puts are L2 address line 471 , true slot hit or castout slot 
line 849 (which is fed to control block 660 and is used 
for maintaining the L1 status array in the L2 cache con- 
troller), cache miss line 294 to processor 100, and status 

30 response line 483, which also includes a cache hit/imiss 
indication for search operations, to L2 cache control 
block 660. 

[0044] The function of data cache control 470 is to 
control the data flow multiplexing into and out of data 
3S cache 400 and send results to the load/store unit 110, 
address translation unit 124, and L2 cache control unit 
118, and also to control writing of data into data cache 
400. 

[0045] Data directory 440 contains address tags to in- 

40 dicate if the contents of the real address are present in 
cache 400, and the status of the cache lines, whether 
modified, shared, or invalid. It also contains an LRU 
pointer for each congruence class, indicating which 
cache 400 line should be replaced. 

45 [0046] Address translation unit (ATU) control 370 
handles translations from effective addresses to virtual 
addresses to real addresses. It receives as inputs L2 
corrected data on line 353, and provides TLB reload da- 
ta output on lines 375 to instruction translation lookaside 

so buffer (ITLB) 358 and data translation lookaside buffer 
(DTLB) 378, ISLB 354, and DSLB 376. With respect to 
look aside tables 354, 358, 376, 378, if a miss condition 
is detected, ATU sequencer 370 requests data (address 
and length) to L2 cache on bus 371 (Fig. 6.) When L2 

55 responds on bus 353 (Fig. 5), ATU examines the data 
to select data for look aside buffer 378, 376, 354, 358, 
as the case may be, or signals a translation exception 
back to the instruction unit. ATU controls 370 tracks seg- 
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ment and page table updates and sends them to L2 con- 
trols on line 371 . Line 381 provides the real address to 
the data cache directory for comparison. 
[0047] The effective address is compared in ISLB 354 
comparator 356 with the virtual address. If these match, 
then a valid effective to virtual address translation exists 
in buffer 354, which transmits the virtual address on line 
355 to compare block 362. 

[0048] I TLB 358 is accessed by an effective address 
on line 365 from register 364 for doing virtual to real ad- 
dress translation. The address input to I TLB 358 is a 
portion of the effective address from I U 1 1 2 on lines 367. 
Comparator 362 compares virtual addresses on lines 
355 and 359, and signals the result on line 363. Asso- 
ciated with each virtual address in ITLB array 358 is a 
real address. The signal on line 363 indicates whether 
or not the address on line 361 is valid. 
[0049] DTLB 378 is accessed by an add ress from reg- 
ister 384. Comparator 382 compares data on lines 379 
and 377, and signals the result on line 383. The signal 
on line 383 indicates whether or not the address on line 
379 is valid. 

System Bus Interface 120 

[0050] Referring to Figures 3 though 6, the system 
bus interface 1 20 and L2 cache control unit 1 1 8 of Figure 
1 will be further described. 

[0051] Correspondence between the high level block 
diagram of Figure 1, and the more detailed illustration 
of the preferred embodiment in Figures 3 to 6, is as fol- 
lows. Bus 101 of Figure 1 corresponds in Figures 3-6 to 
system controls lines 559 at driver/receiver 556, system 
address lines 569 at driver/receiver 564, system data hi 
bus 51 3 at driver/receiver 512, and system data tow bus 
517 at driver receiver 516. Bus 103 to L2 cache 104 of 
Figure 1 corresponds to L2 cache address lines 691 out 
of driver 690, L2 tag address line 693 out of driver 692, 
L2 tag data lines 697 at driver/receiver 694, and L2 
cache data bus 645 at driver/receiver 644. ICU bus 109 
of Figure 1 corresponds (from ICU) to IC request lines 
351 , and (to ICU) DOIC register 606 output lines 607, 
and bypass to IC multiplexer 61 6 on lines 61 7. DCU bus 
111 of Figure 1 corresponds (from DCU) to DC request 
lines 471 and data cache write data bus 425, and (to 
DCU) to bypass to DC multiplexer 620 on lines 621 and 
data cache data out (DODC) register 608 output line 
609. Address translation unit (ATU) input/output bus 1 07 
of Figure 1 corresponds to ATU request lines 371, ATU 
write data bus 373, and multiplexer 61 2 output lines 353. 
[0052] Referring to Figures 4 and 6, requests to L2 
cache control 118 are latched in address/command reg- 
ister 650 from ATU request lines 371, IC request lines 
351 , DC request lines 471, and on lines 567 from ad- 
dress in register 566, which latches system bus ad- 
dresses on lines 565 from receiver 564. These address/ 
command signals are latched as required in registers 
650, 652 and 654 connected via lines 651 and 653. The 



output of the third register 654 is fed to controls block 
660 on line 655. The output of first stage register 650 is 
fed on lines 651 to register 652, driver 690 to provide L2 
cache address signal 691, driver 692 to provide L2 tag 
s address signal 695, ECC checking circuit 684, address 
comparator 664, controls block 660, cache controller 
(CC) snoop address register 670, processor address 
registers CBPADR 674 and CBMADR 676, and address 
multiplexer 680. ECC 684 output is fed on lines 685 to 

10 driver 694 to provide L2 tag data on lines 697. CBPADR 
address register 674 contains the address to the system 
bus in the event of a cache miss, the output of which is 
fed to multiplexer 680 on line 675. CBMADR address 
register 676 contains the snoop address portion, and its 

is output is fed to multiplexer 680 on line 677. Receiver 
694 output from L2 tag data lines 697 is fed on lines 695 
to L2 tag in register (L2TAGIN) 688 and thence on lines 
689 to error correction code (ECC) block 686. The out- 
put of ECC block 686 is fed on lines 687 to comparator 

20 664, address registers 670, 674 and 676. 

[0053] The output of comparator 664 is fed on line 665 
to controls block 660. CCS address register 670 output 
line 671 is fed to multiplexer 678 along with lines 651 to 
generate the data cache L1 snoop address on lines 679. 

2S [0054] The output of address out multiplexer 680 is 
fed on lines 681 to address out register 560, and thence 
on line 561 to the system address bus 569 through driver 
564. The output of controls block 660 is fed on lines 663 
to arbitration and control block 552, and on lines 661 to 

30 address/command register 658. Arbitration and control 
block 552 receives control data from receiver 556 via 
lines 557, and provides output on lines 555 to controls 
block 660, and in the event of an L2 cache miss request 
out control signals are sent on line 553 through driver 

35 556 to system controls bus 559. Another output of con- 
trols block appears on lines 661 to address/command 
register 658, the output of which appears on line 659 to 
multiplexer 672. Multiplexer 672 also receives input 
from lines 653 and 655, and provides its output on lines 

40 673 back to register 650. 

[0055] Referring to Figure 5, ECC block 632, DOIC 
register 606, DODC register 608, L2PDO register 636, 
multiplexer 616 and multiplexer 620 each receive inputs 
from data input register 624 on bus 625. The output of 

45 ECC block 632 is fed on line 633 to L2 data out register 
638, and thence to driver 644 on line 639. The output of 
L2PDO register 636 is fed on line 637 to inpage buffer 
646, the output of which is fed on line 647 to L2PDI reg- 
ister 642 and ECC circuit 632. The output of L2PDI reg- 

so ister 642 is fed on line 643 to DOIC register 606, DODC 
register 608, CCDI register 624, and to bypass multi- 
plexers 620 and 616. The output of multiplexers 620 and 

61 6 represent bypass data, and are fed on lines 621 and 

617 to the DC and IC, respectively. Data cache write 
55 data line 425 is fed to CMCD register 628 and CCDI reg- 
ister 624. The output of CMCD register 628 is fed on 
lines 629 to L2PDO register 636, and castout buffers 
602. 
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[0056] Referring to Figures 3 and 5, 12 cache data in 
from bus 645 is received at receivers 644, fed on line 
649 to L2 data in register 640 and thence on lines 641 
to ECC circuitry 634 and bypass multiplexers 616 and 
620. From ECC circuitry 634, L2 cache data in is fed on 
lines 635 to cache controller data in register (CCDI) 624, 
DOIC register 606 and DODC register 608. DODC reg- 
ister 608 output 609 is fed to data cache unit 1 1 6 (Figure 
1), DC bypass multiplexer 620, ATU multiplexer 612, 
and castout buffers 602. The output of DOIC register 
606 is fed on lines 607 to instruction cache unit 114 (Fig- 
ure 1), ATU multiplexer 612, and castout buffers 602. 
Castout buffers 602 output on lines 603 is fed to data 
high output register 502 and multiplexer 520, the output 
of which is fed on lines 521 to data output registers 502 
and 504. 

[0057] In operation, registers 624 and 636 form a 
pipeline buffer to inpage buffer 646 and register 642. In- 
page buffer 646 caches a line from the system bus. Line 
641 from 12 data in register 640 to bypass multiplexers 
616, 620 allows the saving of a cycle on cache misses 
when error correction is not required. DOIC register 606 
provides corrected data to instruction cache unit 114, 
and DODC provides corrected data to data cache unit 
116. Either register may supply data to the ATU 124. 
[0058] The normal path for routing L2 cache data is 
through register 640, ECC 634, and DOIC register 606 
and DODC register 608. 

Processor Interface Unit 120 

[0059] Referring now to Figure 3, a more detailed de- 
scription of processor interface unit 120 of Figure 1 , and 
associated circuitry, will be provided. Figure 3 repre- 
sents the data flow portion of PIU 120 and System Bus 
102. 

[0060] System bus 102 data high bus 513 and data 
low bus 517 communicate through driver/receivers 512 
and 516, respectively with data high output register 502 
on lines 503, data high in register 506 on lines 51 5, data 
low out register 504 on lines 505, and data bw input 
register 508 on lines 519. Each of busses 513, 517 is 
capable of handling eight bytes of data, providing a 16 
byte data bus. If the system is operating on only eight 
bytes, only one set of the input/output registers (such 
as 504, 508) is used. 

[0061] System data input registers 508 outputs on 
lines 507 and 509, respectively, are fed to multiplexer 
524 and thence, along with registers 506 on lines 507, 
on lines 525 to cache control data in (CCDI) register 624 
(Figure 5), which is the main data input register of the 
cache controller. Data input register 624 output is fed on 
bus 625 to multiplexer 520. 

Load/Store Unit (LSU1 110 

[0062] Load/store unit (LSU) 1 1 0 functions to decode 
fixed point and floating point loads and store and cache 



management operations, and to send effective address- 
es and storage commands to the data cache unit (DCU) 
116. LSU 110 also handles most move-to and move- 
from special purpose register (SPR) 334 instructions. In 
5 addition to functioning as a load/store unit, LSU 1 1 0 also 
controls instruction execution sequencing after instruc- 
tions have been dispatched, through detection of most 
instruction execution interlocks, and the generation of 
resulting pipeline hold signals. 

10 [0063] LSU 1 1 0 provides a six port register file 306, 
made up of four 32x18 register array macros, arranged 
as a 32x72 array with two write ports and four read ports. 
This array implements the 64-bit general purpose reg- 
isters (GPRs) 306. GPR array 306 also provides oper- 

15 ands for fixed point unit (FXU) 106 decode stage (not 
shown) as well as for LSU 110. FXU 106 decodes its 
own instructions and generates requests to LSU 110 for 
the necessary operands, as well as providing on line 327 
a result operand and address when appropriate. LSU 

20 no accesses GPRs 306 for registers needed to gener- 
ate effective addresses (EA), and for data for store in- 
structions. Data operands received from data cache 116 
on line 461 , and updated effective addresses are written 
back to the GPRs by LSU 110. Lines 327 contain the 

25 FPU results, and are fed to register 330. 

[0064] In handling floating point loads and stores, 
LSU 110 generates effective addresses using operands 
from GPR 306, and accesses the necessary floating 
point register (FPR) operands from the floating point unit 

30 (FPU) 108. 

[0065] Instructions dispatched to LSU 11 0 are latched 
in its DECODE cycle instruction register 302 at the end 
of the I- fetch cycle. The basic LSU 110 pipe is three 
stages: DECODE 302/304, EXECUTE 316/320, and 

35 PUTAWAY 330. During the DECODE cycle correspond- 
ing to 302/304, the instructions are decoded, and oper- 
ands are fetched from the GPR 306 array. Addressing 
operands are gated to a 64-bit address generation 
(AGEN) adder, and a 64-bit effective address is calcu- 

40 lated. The effective address (E A) is sent on lines 31 3 to 
the address translation unit (ATU) 124 and to data cache 
unit (DCU) 116 and latched at the end of the DECODE 
cycle in pipeline buffer 316 which holds the effective ad- 
dress during the EXECUTE cycle. 

45 [0066] During the EXECUTE cycle, the operand for 
store operations is passed to the DCU on line 323, 
where it is aligned in block 460 and saved in register 
456 for PUTAWAY in D-cache 400. At the end of the EX- 
ECUTE cycle, if a load type instruction is being execut- 

50 ed, the data operand returns on line 461 toLSU110from 
the DCU, and is saved in pipeline buffer 330 for PUTA- 
WAY. 

[0067] During PUTAWAY cycle 330, as is represented 
by lines 333, up to two 8-byte or one 16-byte operand 
55 may be written to GPR 306. Floating point loads are lim- 
ited to one 8-byte operand per cycle. GPRs 306 are not 
written until late in the PUTAWAY cycle 330. This re- 
quires that operands being written to these arrays be 
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bypassed around the arrays to reduce pipeline inter- 
locks. Delaying the write to GPR 306 also allows sign 
extension for algebraic load operations to be performed, 
helping to balance path delays between EXECUTE 
316/320 and PUTAWAY 330 cycles for these instruc- 
tions. 

Fixed Point Unit (FXU) 106 

[0068] Fixed Point Unit (FXU) 1 06 executes the fixed 
point instructions, not including storage access instruc- 
tions. FXU 106 includes a 64-bit adder, a 64-bit logical 
unit, a 64-bit rotate-merge unit, and a 64-bit carry save 
adder which supports two-bit-per-cycle product forma- 
tion during multiply instructions. 
[0069] During division, quotient formation occurs one 
bit per cycle, through repeated subtraction of the divisor 
from the shifted dividend. 

Floating Point Unit (FPU) 108 



20 



[0070] Floating point unit (FPU) 108 executes the 
floating point instructions, but not the storage access in- 
structions. In one exemplary embodiment, FPU 108 in- 
cludes a 5-port 32x72-bit register array, a 32-bit status- 2S 
control register, a 3-bit overlap scan booth encoder unit, 
2-bit quotient generation unit, a 106-bit carry save 
adder, a 106-bit increment-full adder, an operand align- 
ment shifter unit, a normalizer unit, and a rounder unit. 

30 

Address Translation Unit (ATU) 124 

[0071] Referring, primarily, to Figure 2B, address 
translation unit (ATU) 124 translates the data effective 
address (EA) from load/store unit (LSU) 110 and the in- 3S 
struction effective address from instruction unit 112 into 
real addresses used by the Data and Instruction Caches 
to access their L1 Caches and used by the 12 Cache 
Control Unit 118 to access the L2 Cache 104. 
[0072] Microprocessor 100 implements segment 40 
lookaside buffers (SLB) 354, 376 and translation looka- 
side buffers (TLB) 358, 378, which function as caches 
for segment and page table entries. When a required 
entry is not found in a look aside buffer, ATU 1 24 initiates 
a fetch to L2 cache control 1 1 8 to access segment and 45 
page table entries from memory 126 or L2 cache 1 04. 
[0073] ATU 124 reports any translation data storage 
interrupts (DSI ) to the load/store un it 1 1 0 and any trans- 
lation instruction interrupts to the instruction unit 112. 
Reference, change and tag change bits are all updated so 
by store requests to cache control 118 from ATU 124. 
[0074] Microprocessor 100 provides a 4-entry SLB 
354 for instruction address translation and an 8-entry 
SLB 376 for data address translation. SLBs 354, 376 
contain the most-recently translated segments in a f u lly ss 
associative arrangement. The ESID (Effective Segment 
ID) portion of the effective data or instruction address is 
compared 356, 374 simultaneously to all entries in the 



respective SLB 354, 376 ESIDs when segment transla- 
tion is enabled. 

[0075] ATU 1 24 includes separate instruction and da- 
ta TLBs 358, 378, respectively, to hold the results of vir- 

& tual to real address translations. With virtual to real 
translation active, the VSID from the matching SLB 354, 
376 is compared in comparators 362, 382 against the 
VSID stored in the TLB 358, 378. If a compare is found, 
the Real Page Number (RPN) stored in the matching 

10 TLB 358, 378 entry is used to form the real address. 
Replacement is managed independently in each TLB 
358, 378 by an LRU bit for each of the 256 pairs of en- 
tries. 

is L1 Data Cache Unit fDCU) 116 



[0076] In a preferred embodiment, L1 data cache unit 
(DCU) 116 has the following attributes: 64 KB size, 64 
byte line size, 4-way set-associative, 2 subline-modified 
bits per line, MRU slot selection, 40-bit real address, 
16-byte dataflow to/from processor, store-in design, and 
multi-processor support. The term "cache line 1 refers to 
a 64-byte block of data in the cache which corresponds 
to a single cache directory entry. Slot MRU 402 provides 
selection of one of four sets of cache data during an ex- 
ecute cycle. Real address MRU 430 supplies bits 50:51 
to cache 400 and cache directory 440. Error correction 
ECC (not shown) is provided on cache 400 and cache 
directory 440. Write-thru mode is implemented. 
[0077] The data cache 116 array 400, representing a 
collection of sub arrays, is based on a 1024x78 1R1W 
"virtual" 2-port array macro. It provides for a read oper- 
ation followed by a write operation within a processor 
cycle. Read data remains valid on the array outputs until 
the next read operation begins even if there is an inter- 
vening write operation. Eight of these arrays are used 
to form a 64KB cache 400. Two arrays are used per slot 
to form a 1 6-byte dataflow in and out of the array, rep- 
resented by lines 401 . Data parity is stored in the array 
The last bit stored in the array is odd address parity 
across bits 50:59 of the address used to access the data 
cache. 

[0078] Two arrays are needed to implement data 
cache directory 440. The directory implements a 28-bit 
real page number (RPN) along with five ECC check bits. 
A valid and two subline modified status bits are main- 
tained, and three check bits are stored with them. The 
RPN and status fields are replicated four times to rep- 
resent the four sets accessed at a particular directory 
array address. A 3-bit LRU is shared between two di- 
rectory arrays to indicate the least recently used slot. 
[0079] Slot MRU 402 logically appears as a 1024x4 
array where each entry is associated with a cache line 
in data cache 400. Bits 48:51 of the 48:57 used to ac- 
cess the logical array 400 are effective address bits. 
MRU 402 bits are updated whenever an incorrect slot 
guess or a cache miss occurs. 
[0080] Real address (RA) MRU 430 is used to gener- 
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ate a prediction of real address bits 50 and 51 for ad- 
dressing both cache 400 and cache directories 440. As 
is represented by line 309, array 430 is read as part of 
the AGEN stage 304 of the pipeline. If a load/store unit 
pipeline EXECUTE stage or latched PUTAWAY stage 5 
hold is present, then the array 430 output is not used. 
Real mode is also used to determine if the array 430 
(Figure 2C) output is used. Real mode determination oc- 
curs in AGEN 304 which sets control line 307 to either 
the real or virtual mode addressing. If real mode is de- 10 
termined, then load/store effective address (LSEA) 317 
bits 50:51 are used by register 408 to access cache 400 
and register 41 4 to access cache directory 440 instead 
of RA MRU array 430 output. 

[0081] Real address (RA) MRU array 430 is updated is 
from DC address register 408 via lines 281 whenever a 
wrong prediction occurs with respect to translated ad- 
dress bits 50:51 . Also, data cache address register 408 
and data cache directory address register 41 4 are up- 
dated with proper values of address bits 50:51 via line 20 
381 for reaccessing the cache 400 and cache directory 
440 arrays. Multiplexer 41 2 is then switched under con- 
trol of data cache control block 470 so that address reg- 
ister 408 is used to access cache array 400. A similar 
function occurs with multiplexer 436 so that register 41 4 25 
is used to access the directory array 440. The LSU 110 
pipeline is stalled for one cycle to allow cache 400 and 
directory 440 to be reaccessed in parallel in the same 
cycle. Data is then returned to LSU 110 via line 461 in 
the following cycle. 30 

Instruction Cache Unit (ICU) 114 

[0082] Instruction Cache Unit (ICU) 114 contains the 
physical arrays, address compares, and error checking 35 
circuitry to provide a 64KB 4-way associative instruction 
cache with single-bit error detection and recovery. The 
single-cycle cache access provides up to four instruc- 
tions from a selected 128-byte cache line. Instruction 
cache unit 1 1 4 provides instructions to other functional 40 
units, including branch prediction. 

L2 Cache Control Unit 118 

[0083] The functions of the L2 cache control unit 118 45 
are to provide processor 1 00 with access to a private L2 
cache 104, plus access to memory 1 26 through system 
bus 102 which also supports memory coherence control 
for multiprocessor operations. L2 cache 104 is imple- 
mented as external static RAMs, with one set of SRAMs so 
for the directory and another set for the data. 
[0084] CCU 1 1 8 accepts commands from four sourc- 
es: data cache unit 116, instruction cache unit 114, ad- 
dress translation unit 124, and system bus 102 via Proc- 
essor Interface Unit (PIU) 120. To handle these com- ss 
mands, CCU 1 1 8 uses the buffer structure shown in Fig- 
ure 6. External and internal commands are prioritized 
by CCU controls 660 and placed into ADR/CMD buffer 



650. ADR/CMD buffer 650 output 651 is then used to 
access an L2 directory (not shown) via interface lines 
693 driven by driver circuits 692 to determine the hit/ 
miss status. Additionally, appropriate address bits from 
bus 651 are concurrently used to access an L1 status 
array (not shown) in controls 660 to determine if a data 
cache L1 snoop needs to be done. Finally, ADR/CMD 
buffer 650 is used to control updating status and tag in- 
formation in the L2 directory as required, a process well 
understood in the art. The four L2 hit/miss states are: 

1 ) Modified 

This line is different from memory and no other 
coherent cache has a copy of this line. 

2) Exclusive 

This line is the same as memory and no other 
coherent cache has a copy of this line. 

3) Shared 

This line is the same as memory and other 
caches may have a copy of this line. 

4) Invalid 

This cache and this processor's data cache do 
not have a copy of this line. 

[0085] Data can be in the data cache only if it is also 
in the L2 cache. 

[0086] Commands only stay in ADR/CMD buffer 650 
for three cycles, at which time the command moves to 
ADR/CMD buffer 652 or ADR/CMD buffer 658. A proc- 
essor command is moved into the ADR/CMD buffer 652 
when said command is in ADR/CMD buffer 650 and the 
resources it needs, such as the data flow, are not avail- 
able. The command will stay in ADR/CMD buffer 652 
until the resource becomes available. 
[0087] Commands are moved to the ADR/CMD buffer 
658 from ADR/CMD buffer 650 by way of controls block 
660 when a system bus snoop command needs to use 
the data path. The command will stay in ADR/CMD buff- 
er 658 until the data path is available. Commands that 
need to issue address commands on the system bus 
are placed in ADR/CMD buffer 654. The command will 
stay in ADR/CMD buffer 654, being retried if necessary, 
until a successful address status and response is re- 
ceived from system bus 102. If data movement is re- 
quired the command is then turned over to the CCU data 
flow logic. 

[0088] Feedback from ADR/CMD buffer 658 to ADR/ 
CMD buffer 650 is necessary for two separate functional 
operations. The first feedback case is for processor read 
commands that encountered a shared address re- 
sponse from system bus 102. When the processor read 
command is first in the ADR/CMD buffer 650 the L2 di- 
rectory is marked Exclusive, assuming that this L2 will 
have the only copy of the data. If another device indi- 
cates that it also has a copy of this data, by a shared 
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address response on system bus 102, then the 12 di- 
rectory must be changed trom Exclusive to Shared. 
[0089] The second feedback operation is used for 
processor write operations that must wait for a success- 
ful system bus 1 02 address status and response before £ 
the data can be written. For processor stores or data- 
cache-block-zero (dcbz) instructions that hit shared in 
the 12 directory, the processor must make sure that it 
holds the line in the exclusive state before it updates the 
data. Before the processor can get ownership of the 10 
shared line it may lose the line to another device, so the 
feedback path is provided to reinitiate the directory ac- 
cess. 



L2 Cache Control Unit Snoop Operations 



[0093] Referring to Figures 1 and 3, Processor inter- 
face unit (PIU) 120 controls and monitors all communi- 
cations with the main system bus 102. The main func- 
tions of PIU 120 are: 

1) Transport commands, address, and data be- 
tween CCU 118 and system bus 102. 



15 



[0090] Bus snoop commands from system bus 102 
come in through processor interface unit 120 and are 
presented to ADR/CM D buffer 650 via bus 567. At the 
same time a shift register (not shown) is started. The 20 
shift register is used to time out' the bus snoop com- 
mand. Bus snoop commands require a response within 
a fixed time, but the command may be delayed before 
being brought into ADR/CM D buffer 650 because of oth- 
er higher priority commands. If the shift register limes 25 
out', an address retry response will be issued to the sys- 
tem bus 102. 

[0091] When a bus snoop command is accepted into 
ADR/CMD buffer 650 the L2 directory and L1 status ar- 
ray are checked. If the command hits in the L2 directory 30 
and the L1 status array, then a L1 snoop command is 
issued to the data cache. If data must be moved to com- 
plete the bus snoop command, it will be first moved out 
of the L2 cache into the castout buffer 602. Then if the 
data cache has a modified copy of the data, its copy of 35 
the data will be moved to the castout buffer 602 and sub- 
sequently via bus 603 to system bus 102. 
[0092] The memory management policy is such that 
segment and page translation table entries may not be 
accessed directly from the L1 data cache by the ATU 40 
124. Consequently, another type of snoop operation, a 
processor snoop, is done for ATU commands. When an 
ATU command comes in, the data cache is snooped us- 
ing the L1 status array. If the data cache has modified 
data, the ATU command is stopped until the data is 45 
moved from the data cache to the L2 data RAMs. 

Processor Interface Unit (PIU) /Bus Interface Unit (BIU) 
120 



50 



2) Prune out incoming command-address transfers 
that do not require the attention of CCU 118. 

3) Compensate for clock domain differences be- 
tween the processor 100 units and 6xx Bus 102. 

4) Maintain and monitor system checkstop logic for 
Processor Run-Time Diagnostics (PRD). 

[0094] System bus interface, or processor interface 
unit (PIU) 120, in general, receives commands from L2 
cache controller (CCU) 118 on lines 663, transforms 
them in block 552 to the system bus clock domain and 
presents them on lines 559 to bus 102. It then monitors 
status and response information received on lines 559 
for the command and informs CCU 1 18 on lines 555. As 
commands arrive from the bus on lines 559, PIU 120 
categorizes them into one of three categories: master 
operations, bus snoop operations and other operations. 
Master operations are those originated by CCU 118 on 
the same chip 100 as PIU 120. These operations need 
to the monitored for status and response, updating CCU 
118 as this information arrives. Bus snoop operations 
are those that are originated by other bus units and re- 
quire the attention of CCU 118. PIU 120 passes these 
operations on the bus snoop path to CCU 118 indicating 
a bus snoop and continues to monitor status and re- 
sponse. Other operations are those originated by other 
units that do not require the attention of the CCU 118. 
For these operations, PIU 120 only monitors status and 
response without informing CCU 118. 

Clock Distribution and Control 1 22 

[0095] Clock distribution and control 1 22 contains the 
logic for gating, shaping, and distributing the internal 
clocks as well as the off chip cache and directory clocks. 
[0096] During normal system operation, all clocks are 
derived from and synchronized to a single oscillator in- 
put by a phase locked loop (PLL) circuit which provide 
a 'zero delay 1 clock tree relative to the input oscillator 
and also a frequency multiplier function. Microprocessor 
1 00 uses this function to run the internal processor logic 
at a faster rate than the system bus 1 02 interface logic, 
which runs at the same rate as the oscillator input. A 
second on-chip 100 PLL is employed to derive the 
clocks for the off-chip L2 cache 104. This PLL uses the 
frequency multiplied output of the first PLL as its input. 
An off -chip feedback path, constructed to match the 
path to the cache chips, results in a tow skew delay rel- 
ative to the processor clock domain and allows for syn- 
chronous communication between processor 100 and 
cache 1 04. 
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Part 2 

Coherency Mechanism for Configurable Caches 

[0097] Referring to Figures 1 1 A and 1 1 B, various pos- s 
sible configurations of processors 100 are illustrated in- 
terfacing system bus 102. Processors 100... 100 A, have 
L2 cache data and tag SRAMs 1 04. . . 1 04A attached, and 
processor 1 00B does not. In accordance with this inven- 
tion, a second level cache controller is used in processor 10 
100B in the same manner as if the second level L2 
cache existed, such as in the case of L2 cache 104, 
1 04A at processors 1 00, 1 00A, and the first level cache 
status is fed into the second level cache controls, as will 
be explained hereafter. is 
[0098] In accordance with the invention, the base mi- 
croprocessor is not changed to accommodate a design 
without an L2 cache tag and data arrays. In slightly more 
detail, second level cache control logic continues to ex- 
ist within the microprocessor chip. Rather, only the ex- 20 
ternal second level cache tag and data arrays are re- 
moved and their inputs to the microprocessor tied to an 
inactive state. A configuration switch (which feeds con- 
trols 470 and 660) is set in second level cache controller 
660 that causes bus snoop requests to get reflected on- 25 
to a first level cache snooping path (the L1 snoop.) The 
first level cache status is then fed back to the second 
level cache controller, in a manner consistent with the 
timing required for support of a second level cache di- 
rectory search, and fed into the second level cache sta- 00 
tus signal generation logic, effectively making the sec- 
ond level cache controller believe that the second level 
cache still exists for the snooping control logic. All other 
actions remain the same in the second level cache con- 
troller and an effective and simple method for supporting 35 
snooping bus protocols in the microprocessor has been 
created. A result is that now every bus request snoops 
the first level cache without knowledge of presence of 
an L2 cache. This environment supports entry level sin- 
gle processor configurations where the bus snooping re- 40 
quests only amount to input/output traffic. A similar 
mechanism is employed for certain requests received 
from the processor which require cache access. Specif- 
ically, these requests can originate from data cache 1 1 6 
or address translation unit 124. The L1 snooping path 45 
is used to determine L1 cache status which is substitut- 
ed for L2 cache states. 

[0099] Referring to Figure 7 in connection with Figure 
6, in a system configured as L2 not installed, compare 
664, ECC 684 and 686, L2TAGIN register 688 and D/R so 
694 are not used, and inactrvatkxi of these elements is 
illustrated with hatching. Also, referring to Figure 8 in 
connection with Figure 5, ECC 632 and 634, L2DATAO 
register 638, L2DATAI register 640, D/R 644, L2PDO 
register 636, inpage buffer 646 and L2PDI register 642 ss 
are not used in 12 not installed mode, and inactivation 
of these elements is illustrated with hatching. These 
blocks represent data and control flow to and from the 
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L2 tag and data external SRAMs 104, 104A. These ex- 
ternal SRAMs are not part of the system in the L2 not 
installed configuration of processor 100B. 
[01 00] Referring to Figure 2C and 7, lines 481, 483 
to/from data cache control 470 connect into controls 
660. These lines are search control 481 and status re- 
sponse 483 and are used to replace L2 status reported 
from compare 664 when L2 cache is not installed in the 
system. 

[01 01 ] Referring to Figure 7 in connection with Figure 
6, DC snoop address generation is fed by lines 679. 
Lines 679 are generated by multiplexer 678 which is fed 
by lines 651 and 671. Lines 651 represent a new path 
used to send a data cache search request address to 
data cache 116. 

[0102] A search operation is one in which the data 
cache directory is accessed and results are reported for 
the purpose of determining the status of a cache line in 
the data cache when the L2 cache is not installed in the 
system. 

[01 03] Referring to Figure 1 0, search operations 575 
utilize much of the existing address and control flow 
which already exists for performing data cache 400 L1 
snoop operations with some minor modifications. 
Searches only report status back to the L2 cache con- 
troller 118, replacing the status that would otherwise be 
generated by accessing the L2 tag array in a system 
configuration with L2 installed. 
[0104] Referring to Figure 10, with L2 installed, pipe- 
lined L1 snooping comprises four overlapped stages, as 
follows: REQUEST 571, SNOOP 572, ACCESS 573 
and FLUSH 574. The various registers, arrays and con- 
trols comprising these stages have been previously de- 
scribed in connection with Figures 2C, 2D and 6, and 
are separate pipeline stages from those described with 
respect to the bad/store unit 110, Figure 2A. 
[0105] During REQUEST 571 , a directory access L1 
snoop request is pending to the L1 cache directory 440. 
If directory address register 414 is available as deter- 
mined by DC control 470, then the L1 snoop address 
will be latched into register 414 from cache controller 
snoop (CCS) address register 670 (Figure 6) on DC 
snoop address line 679. 

[01 06] During SNOOP 572, cache directory 440 is ac- 
cessed and the result of the L1 snoop is latched in DC 
control 470. At the same time, data cache read address 
register 408 is latching the address for the access stage 
of the L1 snoop pipeline from line 415 for access in the 
following cycle. 

[0107] During ACCESS 573, cache arrays 400 are ac- 
cessed while results from the SNOOP stage are proc- 
essed. The data read out of cache array 400 during the 
access stage are latched in register 452. 
[0108] During FLUSH 574, the cache data latched in 
register 452 during the ACCESS stage is sent to L2 CCU 
over DC write data line 425. 

[01 09] When data cache 400 is snooped as previous- 
ly described, any data transfers resulting from the L1 
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snoop are sent to the L2 CCU 118. L2 CCU 118 in turn 
transfers this data to either L2 cache 104 or system bus 
102, depending on the originator of the LI snoop re- 
quest. For instance, a L1 snoop resulting from an ATU 
1 24 request will cause the data being transferred by the 
snoop operation to be placed into L2 cache 104. If, on 
the other hand, the L1 snoop is resulting from a system 
bus operation the data will be transferred out to system 
bus 102. L1 snoop data from data cache L1 400 may be 
merged with L2 cache 104 data so that an entire 128 
bytes corresponding to an L2 cache 1 04 line size will be 
transferred out to the system bus 102. When placed on 
system bus 102, the data will then either be directed to 
main memory 126 or another processor 100B, depend- 
ing on the operation. 

[01 1 0] If multiple data transfers are required out of L1 
cache 400 for a given cache line, then a pipeline hold is 
sent to the REQUEST 571 and SNOOP 572 stages of 
the pipeline and the ACCESS 573 stage is then repeat- 
ed. 

[0111] Referring further to Figure 10, with L2 cache 
not installed, the mechanism for doing an explicit search 
request in data cache 400 is similar to that for a L1 
snoop. The main differences are that no data transfers 
ever directly result from a search and that only one 
search request is done at a time. The REQUEST 571 
and SNOOP 572 stages of a pipelined L1 snoop oper- 
ation are essentially reused for performing a search 575. 
The results of a search 575 are reported to L2 cache 
controller 118 in the cycle that would correspond to an 
access 573. That is, in the case of a search 575 the data 
cache 400 is not accessed but the results of the snoop 
stage 572 are still processed and then sent to L2 cache 
controller 118. 

[0112] Referring to Figures 12 and 13, there are two 
types of search requests: 

1) Explicit searches, where the status in the data 
cache directory 440 is interrogated by sending a DC 
snoop address from L2 cache controller 1 1 8 to data 
cache 400. 

2) Implied searches, where the status of the data 
cache directory 440 is not directly interrogated, but 
the search status response lines 483 are checked 
by L2 cache controller 1 1 8. The data cache control- 
ler 470 is responsible for ensuring that these lines 
483 indicate the appropriate state for implied 
searches. 

[01 13] As an example, an ATU 1 24 request to fetch a 
page table for address translation results in an explicit 
search 575 of the data cache 400 in L2 not installed con- 
figurations. The page table may be resident in a line 
stored in data cache 400. If a search results in a data 
cache miss, the operation proceeds out to main memory 
1 26 to fetch the page tables. If the search results in a 
data cache hit, a separate L1 snoop operation 579 oc- 



curs to data cache 400 to flush out any modified data. 
L1 snoop operation 579 includes request 571 , snoop 
572, access 573 and flush 574. The L1 snoop operation 
579 resulting from the search 575 is the same mecha- 

5 nism that is used in a system with L2 cache. If the page 
table is resident in data cache 400, a system with L2 
cache would have determined an L2 hit from accessing 
the L2 tag array. A L1 snoop operation would then be 
sent to the data cache. 

10 [0114] For either an implied search or an explicit 
search, when the L2 cache controller 118 takes a DC or 
ATU request, it initiates operation of a tag state machine 
708 (Figure 9) as if an L2 tag access is occurring. When 
the tag state machine reaches a point where L2 status 

is would otherwise be available, data cache search status 
is queried instead. 

[0115] An example of an implied search is a DCU 1 1 6 
request for data due to a data cache 400 miss. In this 
case, the data cache 400 status is already known and 

20 another directory access via a search operation 575 is 
not necessary. Data cache 400 indicates miss status on 
the status response line 483 in the appropriate cycle. 12 
cache controller 118 then initiates an operation to ac- 
cess main memory 1 26 for the requested data in a man- 

25 ner similar to a system with an L2 cache that has an L2 
cache miss. 

[0116] Another case of an implied search is a DCU 
116 request for exclusive access to a cache line, other- 
wise known as a Dclaim request. This request is initiated 

30 when data cache 400 currently has a cache line in the 
shared state and has a store pending to that line. The 
L2 cache controller 1 1 8 starts the tag state machine and 
checks data cache search status as in the previous 
case. In this case, however, the data cache 400 reports 

35 shared status on the search status lines 483. The L2 
cache controller 118 then initiates a DCIaim on the sys- 
tem bus in a manner similar to a system with L2 cache 
that has the line in the shared state. 
[0117] Another case of an explicit search is a bus 

40 snoop request. The bus snoop request originates from 
a system bus command latched up in processor inter- 
face unit (PIU) 120. This bus snoop is then forwarded 
on to L2 cache controller 118. The L2 cache controller 
118 starts its tag state machine and sends a search re- 

4£ quest, and as previously described, L1 status lines 483 
is substituted for 12 status lines 665. This status 483 is 
then forwarded to the system bus 102 interface from 
control block 660 to PIU 1 20 and thence to system bus 
102. Based on the status and the type of bus snoop, a 

50 separate L1 snoop operation 579 may then be initiated. 
Data may need to be flushed from the L1 cache, in which 
case it is transferred to the L2 cache data flow along the 
path 425 to CCDI 624 and then on store data lines 625 
to data registers 502 and 504. These data are then 

ss transferred to system bus 102 on lines 513 and 517 
through D/R blocks 512 and 516. 
[0118] Referring to Figures 12 and 13, in all of the 
search examples given above, an access of the L2 tag 
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array is replaced with either an explicit or implicit search 
operation. If data needs to be transferred out of data 
cache 400 due to the results of a search, then a separate 
L1 snoop operation 579 is sent to the data cache 400. 
[0119] Referring to Figure 9 in connection with Figure s 
6 and the timing charts of Figures 1 2 and 1 3, a state 
diagram for the tag state machine 708, which is a portion 
of controls 660, is set forth. As is represented by arrow 
703, tag state machine 708 watches in tag access cycle 
1 taccl [0] state 702 for a command from address/com- m> 
mand register 650 on lines 651. In systems where L2 
cache is installed, controls 660 tag access cycle 1 , state 
taccl [0] 702, results in an address being sent on lines 
693 to the L2 tag array (also referred to as the L2 cache 
directory) within cache 104 based on the command re- 15 
ceived on lines 651 . Controls 660 tag access cycle 2, 
state tacc2[1] 704, represents the timing cycle for off 
chip pipelined SRAMS that implement the tag arrays; 
tag data is latched in 12 cache controller 118 in L2 tag 
in register 688 at the end of this state 704. In controls 20 
660 tag compare cycle, state tcmpr[2] 706, L2 tag data 
is available for determining an L2 cache hit 
[0120] In a system with L2 cache not installed, taccl 
[0] 702 responds by activating search control line 481 
. and also a search address is sent from lines 651 through 25 
multiplexer 678 to lines 679 and then to L1 directory ad- 
dress register 414. The L1 cache directory is then ac- 
cessed in the cycle where tacc2[1] 704 is active. L1 
cache directory 440 status 483 is substituted in this state 
tcmprf2] 706 for L2 directory hit status. 30 
[01 21] A system with an L2 cache in the preferred em- 
bodiment implements an 128 byte L2 cache 104 line 
size and a 64 byte data cache 400 line size. All memory 
coherence protocols operate on the L2 cache line size 
of 1 28 bytes. In a system with 12 cache not installed, a 35 
configuration bit is changed such that all memory coher- 
ence protocols operate on the L1 cache line size of 64 
bytes. In accordance with another embodiment, in order 
to maintain cache coherency in a mixed system having 
processors 100, 100 A with L2 104, 104A installed and *o 
processors 100B without 12 installed, all on the same 
system bus 1 02, the L1 and L2 cache line sizes are kept 
the same. 

[01 22] Data cache di rectory 440 has a single read port 
in the preferred embodiment and as a result, search op- 45 
erations 575 must present a busy condition to the proc- 
essor so that cache directory 440 access can occur. In 
certain circumstances, a search request encounters a 
processor operation in progress that requires access to 
data cache directory 440. In this circumstance a busy is so 
presented to the L2 cache controller 1 1 8 on status lines 
483 to indicate that the search request cannot be hon- 
oured immediately. Once the busy drops to the L2 cache 
controller, the search operation is initiated. 
[01 23] An error can also be detected in the data cache ss 
directory 440 while a search operation 575 is in 
progress. In this case, search response status 483 indi- 
cates that the search operation should be retried. If the 
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search request results from a system bus operation, 
then the L2 cache controller will issue a response on the 
system bus that the operation should be retried. 

Advantages over the Prior Art 

[01 24] It is an advantage of the invention according to 
the preferred embodiment that a method and system is 
provided for achieving data cache consistency in a mul- 
tiprocessor or uniprocessor system including either en- 
try level processors having no L2 cache, or high level 
processors including an 12 cache. 
[01 25] It is a further advantage of the system that such 
is achieved with a common microprocessor chip design. 
[01 26] It is still a further advantage of the system that 
a processor design including provision for control of both 
an L1 cache and an installed L2 cache properly func- 
tions when the L2 cache is not installed. 

Alternative Embodiments 

[0127] It will be appreciated that, although specific 
embodiments of the invention have been described 
herein for purposes of illustration, various modifications 
may be made without departing from the spirit and 
scope of the invention. 

Claims 

1. A cache controller for the management of contents 
of a single level of cache, comprising: 

a second level cache controller including a first 
level cache snooping path; and 

a first system interface; wherein 

said second level cache controller is operable 
in the absence of a second level cache for re- 
flecting bus snoop requests received at said in- 
terface to said first level cache snooping path. 

2. The cache controller of claim 1 , wherein 

the second level cache controller includes sta- 
tus generation logic for generating a first status 
signal; the cache controller further comprising: 

a first level cache controller for generating and 
feeding to said status generation logic a second 
status signal consistent with timing required for 
support of a second level cache directory ac- 
cess; 

whereby said second level cache controller is 
enabled to operate and respond as though an 
access of a second level cache directory had 
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occurred. ta to said system bus. 



3. The cache controller of claim 2, wherein said first 
level cache controller is responsive to a snoop re- 
quest forwarded from said second level cache con- s 
t roller for conducting a search operation to generate 
said second status signal. 

4. The cache controller of claim 3, wherein said first 
system interface is an interface to a system bus and 10 
responses to bus snoop requests are directed to 
said system bus for delivery selectively to another 
processor or to another device on said system bus. 

5. The cache controller of claim 3, said search opera- 15 
tkxi comprising an explicit search. 

6. The cache controller of claim 5, further comprising: 

a data cache directory; 20 

said second level cache controller being further 
operable for executing said explicit search by 
interrogating said data cache directory with a 
data cache snoop address and then checking 2s 
said second status signal. 



8. A cache controller according to any preceding 
claim, wherein said first system interface is a sys- 
tem bus interface, and wherein the cache controller 
includes a processor interface; and wherein 

said second level cache controller is operable 
in the absence of a second level cache for reflecting 
to said first level cache snooping path processor 
snoop requests received at said processor interface 
and bus snoop requests received at said system 
bus interface. 

9. The cache controller of claim 3, wherein said first 
level cache controller is further operable to present 
a busy condition to said processor during search 
operations. 

10. The cache controller of claim 9, said second level 
cache controller being responsive to a busy signal 
from said first level cache controller indicating a 
processor operation in progress that requires ac- 
cess to a data cache directory for holding said 
search operation pending until said busy signal ter- 
minates and thereafter for initiating said search op- 
eration. 



7. The cache controller of claim 6, wherein first system 
interface is a system bus interface, further compris- 
ing: 30 

a processor interface unit selectively operable 
for generating a bus snoop request; 

a tag state machine; 35 

said second level cache controller being further 
operable responsive to said bus snoop request 
for 

40 

executing said explicit search to determine if 
data associated with said bus snoop request re- 
sides in a cache line stored in said first level 
cache and selectively generate said second 
status signal as a cache miss or cache hit sig- 45- 
nal; and 

initiating operation of said tag state machine for 
querying said second status signal, and: 

so 

responsive to generation of a first level cache 
miss signal, returning status to said system bus 
interface; and 

responsive to generation of a first level cache ss 
hit signal, causing a separate L1 snoop opera- 
tion to said first level cache to flush out any 
modified data, and forwarding said modified da- 



11. The cache controller of claim 3, said second level 
cache controller being further operable responsive 
to an error being detected during said search oper- 
ation for causing said search operation to be retried. 

12. The cache controller of claim 3, said search opera- 
tion selectively comprising an explicit search or ah 
implicit search. 

13. The cache controller of claim 12, further comprising: 

a data cache directory; 

said second level cache controller being further 
operable for executing said explicit search by 
interrogating said data cache directory with a 
data cache snoop address and then checking 
said second status signal; and 

said second level cache controller being further 
operable for executing said implicit search by 
checking said second status signal. 

14. The cache controller of claim 13, further comprising: 

an address translation unit selectively operable 
for generating an address translation page ta- 
ble fetch request; 

a tag state machine; 
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said second level cache controller being further 
operable responsive to said fetch request for 

executing said explicit search to determine if 
said page table resides in a cache line stored 5 
in said first level cache and selectively generate 
a said second status signal as a cache miss or 
a cache hit signal; and 

initiating operation of said tag state machine for id 
querying said second status signal, and: 

responsive to generation of a first level cache 
miss signal, fetching said page table from main 
memory; and 75 

responsive to generation of a first level cache 
hit signal, causing a separate L1 snoop opera- 
tion to said first level cache to flush out any 
modified data. 20 



15. The cache controller of claim 13, further comprising: 

a data cache unit selectively operable for issu- 
ing a data request due to a data cache miss; 2s 

a tag state machine; 

said second level cache controller being further 
responsive to said data request for initiating op- 30 
eration of said tag state machine for querying 
said second status signal and, responsive to 
said second status signal indicating a data 
cache miss, for initiating a main memory ac- 
cess for the requested data. 35 

1 6. The cache controller of claim 1 3, further comprising: 

a data cache unit selectively operable respon- 
sive to said first level cache having a cache line 40 
in a shared state and a store pending to that 
cache line for issuing a Dclaim request for ex- 
clusive access to said cache line; 

a tag state machine; 45 

said second level cache controller being oper- 
able for initiating operation of said tag state ma- 
chine querying said second status signal and, 
responsive to said second status indicating so 
shared status, for initiating a DCIaim. 

17. A cache controller according to any one of claims 1 
to 3 wherein the first system interface is an interface 

to a processor; and wherein ss 

said second level cache controller is operable 
in the absence of a second level cache for reflecting 
processor snoop requests received at said interface 



to said first level cache snooping path. 

18. The cache controller of claim 3, wherein responses 
to processor snoop requests result in commands 
that are selectively directed to a system bus for de- 
livery to main memory or another processor, or to 
said first level cache controller for performing a sep- 
arate snoop operation. 

19. A method for managing contents of a single level 
cache, comprising the steps of: 

reflecting second level cache snoop requests 
received at a second level cache controller to 
a first level cache controller cache snooping 
path; 

responsive to a reflected snoop request, oper- 
ating said first level cache controller to gener- 
ate and feed to said second level cache con- 
troller a status signal consistent with timing re- 
quired for support of a second level cache di- 
rectory access. 

20. Cache coherency apparatus for a system including 
one or more processor nodes, a system bus inter- 
connecting said processor nodes and other system 
devices, with all processor nodes operating without 
an L2 cache installed, the cache coherency appa- 
ratus at said one processor node comprising: 

a second level cache controller including a first 
level cache snooping path; 

a processor interface to said second level 
cache controller; 

said second level cache controller including 
status generation logic for responding with a 
first status signal to a processor snoop request 
message on said processor interface directed 
to a second level cache; 

means for reflecting said processor snoop re- 
quest to said first level cache snooping path; 

a first level cache controller operable respon- 
sive to said reflected processor snoop request 
for generating and feeding to said status gen- 
eration logic a second status signal consistent 
with timing required for support of a second lev- 
el cache directory access; and 

said second level cache controller further being 
responsive to said second status signal for re- 
placing said first status signal. 

21. Cache coherency apparatus for a system including 
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one or more processor nodes, a system bus inter- 
connecting said processor nodes and other system 
devices, with all processor nodes operating without 
an L2 cache installed, the cache coherency appa- 
ratus at said one processor node comprising: 

a second level cache controller including a first 
level cache snooping path; 



support of a second level cache directory ac- 
cess; and 

said second level cache controller further being 
responsive to said second status signal for re- 
placing said first status signal. 



said second level cache controller including 
status generation logic for responding with a 
first status signal to a bus snoop request mes- 
sage on said system bus directed to a second 
level cache; 

means for reflecting said bus snoop request to 
said first level cache snooping path; 
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a first level cache controller operable respon- 
sive to said reflected bus snoop request for 20 
generating and feeding to said status genera- 
tion logic a second status signal consistent with 
timing required for support of a second level 
cache directory access; and 

25 

said second level cache controller further being 
responsive to said second status signal for re- 
placing said first status signal. 



22. Cache coherency apparatus for a system including 30 
a plurality of processors, a system bus intercon- 
necting said processors, and at least one processor 
node without an L2 cache installed, the cache co- 
herency apparatus at said processor node compris- 
ing: 35 

a second level cache controller including a first 
level cache snooping path; 



a processor interface to said second level 40 
cache controller; 

said second level cache controller including 
status generation logic for responding with a 
first status signal to a bus snoop request on 4$ 
said system bus directed to a second level 
cache or to a processor snoop request on said 
processor interface; 



means for reflecting said bus snoop request or so 
said processor snoop request to said first level 
cache snooping path; 



a first level cache controller operable respon- 
sive to said reflected bus snoop request or £5 
processor snoop request for generating and 
feeding to said status generation logic a second 
status signal consistent with timing required for 
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